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INTRODUCTION 


I.  INTRODUCTION 


This  final  report  covers  the  results  obtained  under 
Contract  NCNR4467  (00)  .  V'Two  areas  of  investigation,  related 
in  basic  concept  but  disparate  in  approach  and  application, 
have  been  considered  in  this  program.  The  initial  effort 
involved  a  conceptual  modeling  of  learning  and  self-organizing 
systems  in  information  theoretic  terms.  The  second  effort 
entailed  transferring  the  conceptual  form  developed  in  the 
initial  study  into  a  control-system  framework  and  resulted  in 
an  attractive  form  of  model-reference  control  system. 

Publications  and  Lectures: 

Mr.  Malcolm  R.  Uffelman  gave  a  series  of  lectures  on 
learning  machines  in  the  "Pattern  Recognition-Models,  Learning, 
Decision  Theory,"  seminar  conducted  by  the  Information  Sciences 
Institute  of  the  University  of  Maryland  June  28  through  July  1, 
1965,  Mr.  Uffelman  also  published  a  paper  "Learning  Systems 
and  Information  Theory"  (1)  1966;  IEEE  international-  Communication 
Conference,  Philadelphia,  Pennsylvania,  June  1966. 


1.  Reproduced  in  the  Appendix  to  this  report. 
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II.  TECHNICAL  DISCUSSION 


* 


! 
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Part  I.  Information  Theory  and  Learning  Systems 


In  this  part  of  the  report,  we  shall  consider  the  basic 
forms  and  functions  of  learning  systems.  Learning  systems 
can  be  considered  on  three  levels  of  complexity  and  all  three 
are  defined  herein.  However,  it  is  shown  that  the  adaptive 
system  forms  the  core  of  each  level  and  it  is  to  this  system 
that  most  of  our  attention  is  directed.  A  theorem  specifying 
the  necessary  order  of  complexity  of  an  adaptive  system  is 
presented  and  proven  and  some  of  its  implementations  are  dis¬ 
cussed. 

Definitions; 

1.  Trained  System:  A  system  which  learns  to  perform 
a  desired  task  through  some  training  procedure,  but  whose 
internal  state  is  frozen  at  the  completion  of  the  training 
process.  An  example  of  a  trained  system  is  a  linear  thresh¬ 
old  device  made  up  of  fixed  resistor  weights  where  the  values 
of  the  resistors  used  are  determined  via  a  least-mean-square 
training  algorithm  using  typical  inputs.  The  learning  process 
can  be  performed  off  line. 

2.  Adaptive  System:  A  system  that  learns  to  perform 
a  desired  task  through  some  training  procedure  and  retains 
the  ability  to  learn  throughout  the  life  of  the  system.  The 
continued  ability  to  learn  implies  the  ability  to  improve 
performance  via  further  training  (i.e.,  on-the-job  training). 
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to  unlearn  tasks,  and  to  learn  new  or  additional  tasks.  The 
numerous  examples  of  adaptive  systems  include  the  CONFLEX  I, 

MINOS  II,  and  the  MARK  I  PERCEPTRON. 

3.  Self-Organizing  System:  An  adaptive  system  coupled 
with  an  automatic  evaluation  system  and  a  built-in  set  of 
goals.  The  purpose  of  the  evaluator  is  to  direct  the  adaptive 
portion  of  the  self-organizing  system  so  that  it  develops  a 
set  of  responses  satisfying  the  «.  Is*.  Examples  of  self¬ 
organizing  systems  include  the  Homeostat  and  the  MIT  model 
reference-control  system. 

4.  Learning  Systems:  Systems  which  can  learn  (with  or 
without  a  teacher)  to  do  jobs.  T’-'pes  of  learning  systems  are 
trained,  adaptive,  and  self-organising  systems. 

Information  Theoretic  Models: 

Assume  the  existence  of  a  trained  system.  To  introduce 
this  approach  to  modeling  learning  systems,  assume  that  the 
function  to  be  performed  is  print  reading  (multifont)  . 

Figure  1  sh^ws  the  information  theoretic  model  to  be  employed. 


Figure  1 

Simple  Information  Theory  Model 


*  It  should  be  noted  that  the  goal  system  itself  can  be 
a  self-organizing  system  that  develops  the  higher  goals 
of  the  overall  system  based  on  primitive  goals.  In  fact, 
the  final  goal  system  can  be  composed  of  a  hierarchy  of 
self-organizing  systems. 


-3- 


t 


,  r 
s 


The  source,  called  the  environment,  produces  outputs 
that  are  the  basic  concepts  involved  in  the  problem  at  hand. 
In  other  words,  the  environment  output  is  not  reality  but 
the  abstract  essence  of  physical  reality.  For  example, 
when  the  environment  output  is  some  letter,  say  "B" ,  it 
introduces  into  the  channel  only  the  concept  "B"  and  not 
an  IBM  ELITE  BACKSLANT  "B"  or  a  PICA  bold  face  "B"  or  any 
other  physical  representation  of  "B". 

For  the  message  (i.e.,  the  output  of  the  environment) 
to  reach  the  receiver,  here  taken  as  a  trained  system,  it 
must  be  transmitted  through  a  channel.  As  noted  in  figure  1, 
the  channel  is  a  noisy  one,  that  is,  it  has  equivocation. 

The  output  of  the  channel  is  a  corrupted  version  of  the 
environment  output.  In  our  model,  the  channel  output  has 
physical  meaning  and  attributes. 

The  environment  defined  above  is  quite  like  the  philo¬ 
sophy  which  Bishop  Berkeley,  an  eighteenth  century  Irish 
philospher,  put  forth  to  refute  materialism;  matter  does 
not  exist  except  as  a  bundle  of  perceptions.  Ultimate 
reality  is  the  concept  of  the  perceived  reality.  It  is  not 
necessary  to  accept  this  philosophy  to  use  the  model  being 
proposed.  The  important  idea  is  that  all  sensing  machines, 
man  included,  can  work  only  with  signals  taken  at  the  output 
of  the  channel;  but  the  general  objective  of  the  sensing 
machine  and  its  superior  parts,  the  combination  of  these 
being  the  trained  system  in  our  example,  is  to  produce  as 
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an  output  the  original  concept  or  something  functionally 
related  to  the  original  concept.  In  our  print  reader  example, 
the  environment  output  might  be  the  "R"  concept.  Due  to  the 
equivocation  of  the  channel,  the  input  to  the  trained  system 
might  be  a  bold  face  block  "R"  with  a  broken  cusp  and  surrounded 
with  carbon  smudges.  The  output  of  the  trained  system  should 
be  a  code  representing  "R",  without  any  indication  of  font  or 
condition  of  print,  in  other  words,  the  original  concept. 

A  trained  system  is  merely  an  adaptive  system  that  has 
been  taught  a  job  and  then  had  the  ability  to  change  internal 
states  destroyed.  Figure  2  shows  the  information  theoretic 
model  of  an  adaptive  system. 


Figure  2.  Information  Theory  Model,  Adaptive  System 


The  model  is  essentially  the  same  as  that  for  the 
trained  system  except  that  the  receiver  ‘is  now  an  adaptive 
system.  The  small  arrow  in  the  corner  signifies  that  changes 
can  be  made  in  its  internal  state.  The  superstructure  shown 
in  dashed  lines  is  the  trainer  and  its  lines  of  information- 
flow  and  control.  All  information  paths  and  control  paths 
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need  not  be  present  in  every  situation.  As  shown,  the 
trainer  knows  the  truth  (i.e.,  the  environment  output), 
physical  reality  (i.e.,  the  channel  output),  and  the 
system  response;  it  can  also  control  the  environment  and 
the  state  of  the  adaptive  system.  The  trainer  and  its 
lines  of  communication  are  shown  as  dashed  lines  to  remind 
us  that  they  (or  some  part  of  them)  are  needed  only  when 
training  is  taking  place  or  when  the  system  performance 
is  being  monitored.  The  various  combinations  of  informa¬ 
tion  and  control,  such  as  AB23,  A13,  or  A23  can  each  offer 
an  interesting  study  into  the  behavior  of  the  system.  However, 
for  the  time  being  these  studies  will  be  postponed. 


Figure  3  shows  the  model  for  a  self-organizing  system. 


Figure  3 

Information  Theory  Model,  Self-Organizing  System 


Again,  dashed  lines  show  information  and  control 
lines  for  the  trainer.  (In  a  self-organizing  system,  the 
dashed  lines  more  likely  show  the  observer.)  In  the  main 
structure,  the  combination  of  the  evaluator  and  the  goals 
are  termed  the  goal  system.  The  input  to  the  goal  system 
from  the  environment  is  through  a  channel,  C',  which  is  in 
general  different  from  channel  C.  The  goal  system  also  has 
an  input,  different  from  C  and  C',  from  the  adaptive  system- 
output  through  channel  C" .  All  channels  can  introduce  noise. 
It  is  the  purpose  of  the  goal  system  to  consider  the  response 
of  the  adaptive  system  and  change  the  internal  structure  of 
the  adaptive  system  as  required  to  make  the  responses  tend 
to  satisfy  the  goals. 

It  should  be  noted  that  the  goal  system  itself  can 
be  a  self-organizing  system  based  on  more  primitive  goals. 

Such  an  organization  would  allow  the  main  self-organizing 
system  to  develop  higher- level  goals  based  on  the  primitive 
goals  established  for  the  goal  system. 

Also  notice  that  when  a  trainer  is  used  to  give 
instructions  to  the  self-organizing  system,  the  trainer 
does  not  have  direct  control  of  the  adaptive  portion  as  it 
did  in  the  strictly  adaptive  system  model  of  figure  2.  In 
figure  3,  the  trainer  can  only  affect  the  decision  of  the 
goal  system.  This  is  as  it  is  in  a  biological  teacher-student 
situation. 


Analysis: 


Assume  the  existence  of  the  trained  system  illustrated 
in  figure  1.  The  purpose  of  the  system  is  to  produce  outputs 
related  to  the  environment  outputs.  Without  loss  of  generality, 
the  system  can  be  assumed  to  be  a  pattern-recognition  system 
with  the  function  of  producing  an  output  that  is  a  coded  form 
of  the  environment  output.  In  general,  the  environment  has 
a  limited  repertoire  of  outputs. 


The  output  function  of  the  environment  can  be  represented 
as  P(E^) •  the  probability  that  output  E^  will  occur  at  a 
given  time  (a  discrete  ergodic  source  is  assumed)*.  The  E 

i 

are,  as  stated  before,  the  concept  of  class;  in  other  words, 

the  E^  are  the  classes  (or  categories)  to  which  the  outputs 

of  the  channel  belong  and,  in  effect,  each  channel  output  is 

a  noisy  member  of  E^.  For  simplicity,  P(E^)  at  time  t^  is 

taken  to  be  independent  of  P(E.)  at  t  (i.e.,  the  same 

1  K*“  J. 

as  selection  with  replacement) .  Consequently,  P(E^)  is  the 
a  priori  probability  of  class  E^. 

The  output  of  the  channel  can  be  represented  by  P(Sj|E^), 
the  probability  of  the  physical  stimulus,  S.,  given  the  concept, 
E^.  Thus,  as  stated  above,  the  channel  introduces  noise  via 


A  discrete  source  is  assumed  for  convenience;  an  ergodic 
source  is  assumed  because:  (a)  nature  must  have  a  high  degree 
of  stationarity  or  how  could  anything  learn  about  it,  and  (b) 
nature  must  have  its  main  concepts  remain  fixed  over  the  en¬ 
semble  or,  again,  how  could  anything  learn  about  it? 


A  mapping  of  the  concept  into  physical  reality.  The  trained 
system  must  take  measurements  on  the  physical  stimulus#  and 
based  on  these  measurements  produce  the  concept  as  an  output. 
Thus,  the  function  of  the  receiver  (in  this  case,  the  trained 
system)  is  to  remove  noise. 

Assume  that  the  form  taken  by  physical  reality  at  the 
channel  output  is  a  binary  code;  this  removes  the  problem 
of  noise  being  introduced  by  the  measurements. 

At  this  point,  let  us  summarize.  The  function  performed 
by  a  learning  system  is  the  removal  of  noise;  thus,  a  learning 
system,  after  training  (another  name  for  designing  and  de¬ 
bugging)  ,  is  a  filter.  After  a  filter  is  designed  and  working, 
it  is  of  small  interest.  The  design  is  the  interesting  part 
of  a  filter's  life.  Consequently,  let  us  now  turn  our  atten¬ 
tion  to  the  training  of  an  adaptive  system. 


As  done  previously,  we  define  the  input  pattern  to  the 
adaptive  device  as  an  array  of  n  binary  variables  (1,  -1) . 

The  set  of  patterns  (S.^,  si2*’’Srn^  rePresents  the  corrupted 
versions  of  the  classes  (C^,  c2'***cr)*  During  training,  the 
adaptive  device  is  adjusted  so  that  it  maps  any  input,  S^, 
onto  the  proper  class  concept,  C..  "Adjustment"  means  finding, 
by  some  procedure,  an  internal  state  of  the  device  that  performs 
the  desired  filtering  function. 


If  we  restrict  our  attention  to  a  two-class  problem, 

the  number  of  possible  filter  functions  (i.e.,  dichotomies) 

N 

for  N  patterns  is  2  .  In  other  words,  with  two  concepts 

being  emitted  by  the  environment  and  with  the  noisy  channel 

producing  N  patterns  in  response  to  the  two  concepts,  the 

variability  of  the  problem  confronting  the  adaptive  device 
N 

is  2  .  Since  we  can  physically  grasp  the  signals  only  ac 
the  input  of  the  adaptive  device,  this  variability  is  the 
effective  source  variability.  We  call  it  the  transfer 
variability,  V  ,  which  can  be  considered  the  number  of  trans¬ 
fer  states  possible  for  N  patterns.  The  base  2  logarithm 
V  is  called  the  transfer  entropy,  H  .  By  some  training 

u  u 

procedure,  we  hope  to  adjust  the  adaptive  device  so  that 
the  output  of  the  device,  given  the  desired  transfer  state 
and  the  input  pattern,  Sq,  is  completely  predictable  by  an 
outside  observer. 

The  adaptive  device  has  a  number  of  possible  internal 
states,  each  a  different  decision  surface.  We  must  be  careful 
to  distinguish  between  distinct  internal  states  and  the  number 
of  structural  states.  For  example,  in  a  simple  linear-threshold 
device  having  two  inputs,  we  can  have  more  than  one  set  of 
weights  (i.e.,  more  than  one  structural  state)  forming  the  same 
separating  plane.  Each  different  set  of  weights  is  a  component 
of  the  number  of  structural  states,  but  taken  as  a  group,  the 
weights  form  only  one  distinct  internal  state.  The  base  2 
logarithm  of  the  number  of  internal  states  is  called  the  adap¬ 
tive  capacity,  Ha#  of  the  classifier. 


THEOREM: 


r 


I 


For  an  adaptive  classifier  to  be  able  to  perform  all 
of  the  possible  dichotomies  of  N  patterns,  it  is  necessary 
for  Ha  to  at  least  equal  Ht« 

PROOF : 

Assume  that  some  form  of  adaptive  classifier  can  achieve 

perfect  classification  for  any  and  all  dichotomies  of  N 

patterns  with  an  adaptive  capacity,  H  ,  less  than  the  transfer 

a 

entropy,  H  .  This  means,  of  course,  that  the  classifier  can 

N 

dichotomize  the  patterns  using  fewer  than  2  internal  states. 

If  a  table  is  made  that  relates  each  internal  state  to  the 
dichotomy  performed,  there  will  be  one  or  more  internal  states 
having  more  than  one  dichotomy  listed  with  it.  Therefore,  each 
internal  state  can  form  more  than  one  decision  surface  (i.e., 
can  perform  more  than  one  dichotomy) ,  or  the  dichotomies  related 
to  each  of  those  internal  states  are  the  same.  Both  of  the 
conclusions  contradict  the  definitions;  therefore,  the  initial 
assumption  is  wrong,  and  no  form  of  adaptive  classifier  can 
perform  all  possible  dichotomies  if  H  is  less  than  H. . 

Application: 

Let  us  consider  the  application  of  the  theorem  stated 
above  to  a  popular  form  of  adaptive  classifier,  the  linear 
threshold  classifier  shown  in  figure  4  on  the  following  page.* 


IS 
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Another  form  of  adaptive  classifier  is  also  analyzed  in  the 
appendix. 
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Figure  4 

Linear  Threshold  Classifier 


First  let  us  assume,  for  the  moment,  that  all  of  the 

dichotomies  of  N  equal  2°  patterns  are  possible.  Thus, 

2n 

the  number  of-  transfer  states  (or  dichotomies)  is  2  and 

Ht  “  log2  ^  (1) 

=  2n  bits 


Therefore,  the  device  must  have  an  adaptive  capacity 

of  2n  bits.  Since  there  would  be  no  reason  to  expect  any 

weight,  W. ,  to  need  more  range  than  any  other  weight,  W 

i  j  • 

we  can  find  the  capacity  required  by  each  weight,  H^. 


H 


bits  per  weight. 


(2) 


'-f* 

r 

>4* 
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Of  course,  all  of  the  dichotomies  are  not  possible 
using  a  linear  threshold  classifier.  Cameron  (1)  has  es¬ 
tablished  the  upper  bound  on  the  number  of  linearly  separable 
dichotomies  of  2n  patterns,  R(n),  as 


<»>  (siL)n 

frn  n 


(3) 


If  we  don't  try  impossible  dichotomies,  then  log,,  R(n) 
is  less  than  the  true  Ht: 


H, 


2  e2 


n 


nn 


n 


<  h  log0  f  2  \  +  n 


1o(32  (  *1 


n 


n 


(4) 


<  h  [1o92  2~1°92’nn  +  n  [log2e2n-log2nJ 


<  h  [l  ~  log2TrnJ  +  n  [log2e  +  n  -lo92n] 

Rearranging  the  foregoing, 

2 

Ht  <  n  +  n  log2e  -  (n  +  h)  log2n  -  \  log2n  (5) 

And,  if  n  is  large  compared  to  unity,  this  is,  without  great 
error, 

s  n2  -  n  log2n  (6) 

Therefore,  to  a  reasonable  approximation, 

2 

H  £  n  -  n  log  n 


(7) 


Again,  having  no  reason  to  assume  otherwise,  we  can  compute 
that  each  weight  needs 


2 

n  -n  log.n  bits 

_ _ !i_  (8) 

n  +  1 

Or.,  if  n  is  large,  the  classifier  needs  about 

n  -log2n  bits  per  weight.  (9) 


Now,  let  us  take  a  more  reasonable  approach.  Several  in¬ 
vestigators  (2)  have  shown  that  the  natural  capacity  of  a 
linear  threshold  device  is  2  (n+1)  patterns.  In  other  words. 


if  N  is  equal  to  or  less  than  2 (n+1),  and  n  is  large,  a 
linear  threshold  classifier  can  perform  any  desired  dicho¬ 
tomy  with  probability  near  unity.  Let  us  assume  that  a 
linear- threshold  classifier  can  perform  any  dichotomy  of 
n  equal  to  or  less  than  2  (n+1)  patterns: 

H  =  log  22(n+1)'  =  2  (n+1)  bits  (10 

t  2 

therefore,  H  must  at  least  equal  2 (n+1)  bits.  Again,  we 
a 

can  compute  the  average  capacity  required  by  each  weight: 

H  =  2<n+1) 

w  (n+1)"  U1 

=  2  bits  per  weight. 

Notice  that  this  is,  by  the  above  development,  only 
an  average  value  and  that  it  is  a  necessary  condition. 


Therefore,  we  can  conclude  that  to  perform  dichotomies 
involving  all  possible  patterns,  a  prohibitively  large 
memory ^capacity  (n-n  log^n/n+l  bits  per  weight)  is  required. 

If  a  linear-threshold  classifier  is  used  within  its  natural 
capacity,  the  necessary  capacity  of  the  weights  is  quite 
modest  (2  bits  per  weight) . 

Further  Considerations: 

One  interesting  form  of  adaptive  system  based  on  a 
perception-like  organization  is  the  Multivac  (3) ,  which  uses 
a  memory  cell  having  a  one-bit  capacity  (i.e.,  1  bit  per 
weight) .  Since  our  theorem  says  that  for  n  weights,  at  1  bit 
per  weight,  the  machine  can  learn,  at  most,  n  patterns,  let 
us  consider  under  what  conditions  it  can  learn  any  dichotomy 
of  n  patterns. 

THEOREM: 

Given  an  n-dimensional  space  and  n  binary  patterns  in 
that  space,  then  if  the  patterns  are  linearly  independent,  any 
dichotomy  of  them  can  be  performed  by  a  modulo  2  threshold- 
classifier. 

PROOF : 

Arrange  the  patterns  in  an  n  x  n  matrix,  called  the 
pattern  matrix,  with  each  row  being  a  pattern.  The  problem 
can  now  be  stated  as  follows:  given  a  pattern  matrix  with 
linearly  independent  rows,  a  column  matrix  containing  n  binary 


elements  (called  the  weight  matrix)  and  a  column  matrix  con¬ 
taining  n  binary  elements  (called  the  classification  matrix) 
then  for  the  following: 


i 


PW  =  C;  Mod  2 

where:  P  is  the  pattern  matrix, 

W  is  the  weight  matrix,  and 
C  is  the  classification  matrix 

the  elements  of  W  can  be  uniquely  specified  for  any  arrangement 
of  ones  and  zeros  in  the  C  matrix. 

This  can  be  proven  as  follows.  Since  the  P  matrix  has 
linearly  independent  rows,  its  left  inverse  (Mod  2)  exists 
(see  Peterson,  "Error  Correcting  Codes,"  Wiley,  1961).  Thus, 
we  can  write: 

W  =  P_1C;  Mod  2 

which  presents  a  means  for  computing  the  elements  of  W.  Since 
we  have  found  a  way  to  compute  W,  the  theorem  is  true. 

The  theorem  can  be  translated  from  Modulo  2  to  real  posi¬ 
tive  numbers  by  having  a  threshold  unit  which  decides  even  or 
odd  rather  than  greater  than. 

Thus,  the  Multivac  cannot,  in  general,  classify  in  any 
dichotomy  of  (N+l)  or  more  patterns,  but  for  N  or  less  linearly 
independent  patterns  and  an  odd-even  threshold  device  it  can 
perform  any  dichotomy. 


Part  II.  A  Self-Organizing  Control  System 


Figure  3  of  Part  I  shows  the  model  of  a  self-organizing 
system.  '  If  we  consider  only  linear  systems,  notice  that  as  long 
as  "C"  communicates  to  the  evaluator  the  output  of  the  linear 
channel  C  in  series  with  the  adaptive  system,  then  their  order 
can  be  reversed  and  adaptation  not.be  affected.  This  reversal  is 
shown  in  figure  5,  where  we  now  call  the  channel  a  closed-loop 
plant;  and  the  adaptive  system  is  a  preprocessor  of  the  input. 
Thus,  we  can  translate  the  general  model  of  the  self-organizing 
system  of  Part  I  into  a  form  of  model  reference-control  system. 
The  basic  concept  of  this  form  of  system  is  that  the  required 
adaptation  takes  place  in  the  preprocessor  located  in  the  signal 
path. 


GOAL  AND 
EVALUATOR 


CLOSED 
LOOP  CON¬ 
TROL  SYSTEM 


Figure  5 

Block  Diagram,  Self-Organizing  Control  System 

Two  general  conditions  are  imposed  for  this  discussion 
the  closed-loop  system  is  unconditionally  stable  for  all 
changes  in  the  plant  function,  and  the  closed-loop  system 
is  linear. 


I 

* 


Initial  Assumptions: 

Since  we  shall  be  concerned  with  using  a  statistical 
measure  to  evaluate  the  behavior  of  the  system,  we  assume 
that  all  signals  are  bounded  real  functions  of  time  and 
that  the  plant  function  and  the  input  signal  are  stationary 
over  the  time  regions  of  interest.  If  y(t)  is  one  of  the 
signals,  then  yT(t)  is 


yT(t)  =  y  (t)  ?  -T  a  t  *  T 

yT(t)  «  0  ;  t<-T,  t>+T 

and  the  Fourier  transform  YT(jw)  is  defined  by 

+T  _  . 

YT(jw)  “  i  YT(t)e~:iW  dt 


(12) 


(13) 


The  System; 

Figure  6  j  s  a  detailed  block  diagram  of  the  self¬ 


organizing  control  system 


-cT(t: 


Model  Transfer  Function  M(jw) 
Preprocessor  Transfer  Function  K(jw) 
Control  System  Transfer  Function  P(jw) 


Figure  6 

Detailed  Block  Diagram 


The  Mean  Square  Error  Measure: 


The  mean-square  error-measure  is  defined  as: 

'  fT 

,_= - -  lim  1  '  0T(t)-cT(t))2dt 


r  2 

Vj*  (t) 


T-“  2T  ’ 


where  d(t)  is  the  desired  system  output,  and 
c(t)  is  the  actual  system  output. 

Following  Wiener  (4),  we  shall  use  this  measure  to 
evaluate  the  performance  error,  aT(t)  of  Figure  6.  Our 
method  of  using  the  mean- square-measure  is  standard;  we 
shall  at  all  times  attempt  to  adjust  the  parameters  at  our 
disposal  in  such  a  way  as  to  minimize  the  mean-square  error. 

In  adopting  the  mean- square  measure,  we  are  stating 
that  we  are  willing  to  accept  many  small  differences  between 
the  desired  output  and  the  actual  output,  but  that  large 
differences  are  to  be  heavily  penalized.  Obviously,  there 
are  cases  where  this  is  not  a  good  measure  (i.e.,  cases 
for  which  a  miss  is  as  good  .as  a  mile)  . 

However,  for  most  applications,  the  simplicity  of 
the  mean-square  measure  makes  it  attractive  enough  to 
use  even  if  it  results  in  suboptimal  goal  achievement.  It 
is  to  these  applications  that  the  system  described  here 
is  addressed. 


-19- 


l 


The  adaptive  preprocessor  is  a  network  based  on  the 
synthesis  procedure  of  Wiener  (  4).  Its  structural  form 
is  indicated  in  Figure  7. 


E(ju)) 


Figure  7 
Preprocessor 


The  transfer  functions,  L_(jw) ,  of  the  parallel 
filters  are  orthonormal  functions  (3)  such  as  the  Laguerre 
functions.  Orthonormal  functions  are  defined,  for  our 
purposes  by 


+  00 


i 


L.  (jw)L  .  (jw)dw- 

CD  3 


rl  if  i  =  j 
10  if  i  ^  j 


(15  ) 


where  the  asterisk  denotes  the  complex  conjugate. 


The  preprocessor  can  be  used  to  approximate  a 
transfer  function,  K(jw)  by 

N 

K(jw)  =  ^  kiLi(jw) 


(16) 


1  =  0 


where:  k. 

i 


2rr 


J  K(jw)Li  (jw)dw 


We  shall  consider  only  those  sets  of  functions  which 
are  complete  (3) . 


Analysis: 


From  Figure  6,  we  can  write 
A(jw)  =  M(Jw)R(jw)  -  C ( jw) 

=  R(jw)  |jtt(  jw)-K(  jw)P  (jw)  J 


(17) 


We  desire  to  find  an  expression  for  the  mean-square 


error  a  (t)  .  By  using  Parseval’s  theorem,  we  can  express 


2  (t)  in  terms  of  A(jw): 
_  +T 


a  (t)  -  lim  1 
T-®  2T 


aT(t)dt 


(18) 


-T 


p+e 


a  (t)  =  lim  _1 

T 


r-i 

00  2t  L2tt 


at(3w)At  ( jw)  dw 


] 
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We  can  simplify  the  notation  hy  defining  that  equations  used 
herein  of  the  form 


a2  (t) 


A(  jw)A*  ( jw)  dw 


are  to  he  interpreted  as  in  Eq.  19.' 

Using  Equations  17  and  19s 
+  00 


a  (t)  = 


R(  jw)  R*  ( jw)  [m  { jw)  M*  ( jw) 


-M  ( jw)  K*  ( jw)  P*  ( jw) 

-M*  ( jw)  K  ( jw)  P  ( jw) 

+K  ( jw)  P  ( jw)  K*  ( jw)  P*  ( jw)  ]dw 

Substitution  of  Equation  16  into  Equation  21  yields 
+» 


a  (t)  = 


■» 

R { jw)  R*  ( jw)  [m(  jw)  M*(  jw)  N 

-M(jw)P*(jw)  l  k.L.Mjw) 


1=0 


k.L.  ( jw) 

(22) 

D  D 

> 

m  m 

Y  knLn*(jW)] 
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Differentiation  of  Equation  22  with  respect  to  k.  yields 
0+®  1 


9  a  (t)  = 

&k. 


—0° 


R(jw)R*(jw)  £  -M  ( jw)  P*  ( jw)  Li*  ( jw) 
-M*(  jw)P(  jw)L.  ( jw) 

+  2kiP  ( jw)  P*  ( jw)  Li  ( jw)  ( jw) 


] 


(23) 


dw 


Differentiating  again  with  respect  to  k.  yields: 

P+®  1 

a2  a2 (t)  ~ 
a2ki 


^R  ( jw)  R*  ( jw)  P  ( jw)  P*  ( jw)  ( jw)  L*  ( j w)  J  dw  (24) 


From  equation  13,  we  can  see  that  there  is  a  single  value 

— 2T7 

for  k.  that  will  cause  3  (t)  to  equal  zero.  Further,  from 

1  ci 

3k. 

x  - 

2  ,  . 

equation  14,  we  can  see  that  the  second  derivative  of  a  (t) 
with  respect  to  ^  is  always  positive,  indicating  that  the 
single  extremum  is  a  minimum  point. 

Therefore,  the  mean- squared  performance-error  for  the  system 
is  simply  shaped  surface  (hyperparabolic)  with  a  single  minimum 
point.  This  form,  therefore,  avoids  the  problem  common  to  the 
other  form  of  model  reference  systems  in  which  nonsimple  sur¬ 
faces  must  be  searched. 

A  Simple  Search  Procedure 

Although  there  are  several  methods  of  searching  for  and 
finding  the  minimum  of  a  simple  quadratic  surface,  only  one 
will  be  discussed  here. 


First,  let  us  rewrite  equation  23, 


o+“ 

&a2(t)  *  R(  jw)R*(  jw)|  -M(  jw)P*(  iw)L*(iw)-M*(iw)P(iw)Li(iw)  | 

“Ski"  J  .  ...  1 


+  2ki  R(  jw)  R*  ( jw)  [  P  ( jw)  P(  jw)  Li  ( jw)  L*i( jw)  Jdw 


If  we  perform  the  indicated  integration,  we  obtain 
2 

— — - — s  j  +  2kil 
dki  I1  +  2KlI2 


where  I  and  are  real  numbers. 


Integration  of  equation  24  reveals 


a  2  2 


a  (t)  =  I. 

Vki2 


where  1^  is  a  real  number. 


Equation  26  indicates  that  the  minimum  mean  square 
error  can  be  found  by  independent  adjustment  of  each  para¬ 
meter,  ki.  There  is  a  simple  way  to  make  this  adjustment. 
Consider  the  general  form  of  a  quadratic  in  one  variable,  x, 

2 

y  =  Ax  +  Bx  +  C  1 


=  2Ax  +  B 


Simple  algebra  tells  us  that  Y(minj  occurs  at 
-B 


x  = 


2A 


and  that 


v  -  c-  2- 
y (min)  4A 


(31) 


(32) 


If,  in  general,  we  know  the  value  of  xQ  at  time  tQ,  the  change 
in  x,  Ax,  required  to  move  to  y ^  is 


dx 

Substituting  equation  33  in  equation  29,  we  obtain 


y  =  A  (Xg+Ax)  +  C 

-  *C.W  W2  +  ^VV  2?  )  +  C 


v  =  C  .  S_ 

Y  u  4A 


(3-5 


(34) 

(35) 


which  is,  of  course,  the  same  result  obtained  in  equation 

32.  The  extension  of  this  to  the  multivariable  case  is  obvious. 

Thus,  the  problem  of  searching  for  the  minimum  becomes 
a  problem  of  evaluating  the  first  and  second  partial  deriva¬ 
tives  with  respect  to  each  k^  and  calculating  the  Aki  required. 

Methods  for  performing  these  operations  are  well  known  (5) . 
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a-  H.  ?i; 


In  summary,  one  simple  search  procedure  is  to  sequentially 
vary  each  ki,  evaluate  the  first  and  second  partial  deviates 
for  each  ki,  using  the  data  obtained  from  the  variation,  and 
independently  calculate  the  change  required  in  each  ki  to 
obtain  the  minimum  mean- square  error. 

In  practice,  two  factors  must  be  considered.  First,  the 
actual  computation  of  a^(t)  will  most  likely  be  performed  by 
a  low-pass  filter.  Wiener  (4)  has  shown  that  this  yields  a  good 
estimation  of  the  true  mean,  if  the  filter  time-constant  is  long 
with  respect  to  the  bandwidth  of  the  signal  to  be  averaged. 
Second,  noisy  measurements  will  present  an  accurate  computation 
of  Ski.  This  will,  in  general,  result  in  a  failure  to  minimize 
the  mean-square  error.  Repeated  application  of  the  search  pro¬ 
cedure  of  averaging  the  results  of  several  variations  of  the 
k^  can  reduce  the  amount  of  misadjustment  at  the  minimum. 


Experimental  Study: 


An  experimental  study  of  the  self-organizing  control  system 

was  conducted  using  general-purpose  analog  computers  to  simulate 

control  plants  and  reference  models.  A  special-purpose  analog 

computer  was  built  to  provide  a  10- stage  Laguerre  network.  The 

general  term  of  the  Laguerre  network,  L  (s) ,  is 

n 


JIb.  (P-s)n 

211  (p+s)n+1 


(36) 


This  can  be  rewritten  as 


The  term 


2ll  (p+s) 


can  be  realized  by  the  circuit  of  figure 


C 


Figure  8.  Lg(s) 


The  term  can  be  realized  by  the  circuit  of  figure 

p+s 


Thus  any  order  function,  L^(s) ,  can  be  realized  by  one  cir¬ 
cuit,  as  shown  in  figure  8,  followed  by  n  cascaded  circuits,  as 
shown  in  figure  9.  The  simulator  constructed  for  this  study 
used  a  p  equal  to  '  x  2.0-2  * 


'Ol*-' 


A  variety  of  plants  and  reference  models  were  studied. 
Typical  of  these  is  the  following: 

1.  Plant  open-loop  transfer  function 

K(s)  =  -i 

2.  Reference  model 


The  experimental  set-up  is  shown  in  figure  10. 


Figure  11  shows  the  mean- square  error,  E  (t) ,  for  the 

variation  around  minimum  for  L  (t) ,  L  (t)  and  L  (t)  as  a  func- 

U  5  9 

tion  of  scale  gain  (i.e.,  as  read  off  the  indicators  on  the 
simulator.) 


Figure  12  shows  the  E  (t)  for  the  same  terms  after  the 
scale  gains  have  been  corrected  for  pot  load.  The  pot  loading 
curve  is  given  in  figure  13. 


These  results  are  typical  of  all  cases  tried  and  shown  the 
2 

quadratic  nature  of  E  (t)  as  a  function  of  the  kVs. 


ure  11 


e2  (volts)|  1,5 
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INTRODUCTION 

A  learning  system  can  be  broadly  defined  as  a  system  "whose 
actions  are  influenced  by  past  experiences."*  In  this  paper, 
we  shall  restrict  our  attention  to  adaptive  pattern  classifiers, 
a  subclass  of  learning  systems.  It  is  generally  accepted  that 
the  ability  of  adaptive  pattern  classifiers,  either  nonparametric 
or  parametric,  to  classify  correctly  a  set  of  patterns  is  limited 
by  the  structural  or  algorithmic  composition  of  the  system. 

An  information  theoretic  model  of  an  adaptive  classifier  and 
one  limitation  on  performance  are  considered  in  this  paper. 


DISCUSSION 

An  elementary  information  theoretic  model  for  an  adaptive  pat¬ 
tern  classifier  and  its  input  mechanism  is  shown  in  Figure  1. 

For  the  purpose  of  this  model,  the  source  (or  environment)  is 
based  on  the  philosophy  which  Bishop  Berkeley,  an  18th  century 
Irish  philosopher,  put  forth  to  refute  materialism;  matter  does 
not  exist  as  a  bundle  of  perceptions.  Ultimate  reality. 


(represented  by  the  source)  is  the  concept  of  the  perceived 
reality  (represented  by  the  output  of  the  channel) .  Thus,  in 
this  model,  the  source  produces  only  the  abstract  concept  of  a 
class.  The  perceived  representation  (i.e.  the  physical  repre¬ 
sentation)  which  appears  at  the  output  of  the  channel  is  a 
corrupted  or  noisy  version  of  the  concept.  Notice  that  the 
very  process  of  taking  on  a  physical  representation  is  con¬ 
sidered  noise.  The  purpose  of  the  adaptive  classifier  is  to  pro¬ 
duce  outputs  which  are  either  the  concepts  or  related  to  the 
concepts.  In  other  words,  if  the  classifier  is  being  used  as  a 
font  reader,  a  chain  of  events  might  bet  (The  source  produces  as 
an  output  the  concept  of  "R";  as  a  result,  the  channel  produces 
an  "R"  in  Canterbury  Pica  with  a  broken  serif?  and  finally,  the 
output  of  the  classifier  is  a  binary  code  representing  "R". 

Thus,  the  classifier  has,  in  effect,  removed  the  noise  introduced 
by  the  channel  and  returned  the  pure  signal,  the  concept  "R". 

Based  on  the  above,  we  can  adopt  the  point  of  view  that  the 
classifier  is  a  filter  whose  function  is  to  remove  noise.  The 
degree  to  which  it  can  remove  the  noise  depends  on  the  type  and 

i 

quantity  of  the  noise  and  the  functional  complexity  of  the  fil¬ 
ter.  By  introducing  two  factors,  akin  to  source  entropy  and 
channel  capacity,  it  is  possible  to  define  a  necessary  condition 


on  the  ability  of  the  classifier  to  remove  noise 


ANALYSIS 

For  simplicity  we  shall  take  the  channel  output  to  be  binary. 

If  there  are  N  patterns  to  be  classified  into  2  classes,  there 

are  2**  possible  dichotomies.  In  other  words,  the  training 

N 

amounts  to  adjusting  the  classifier  to  perform  one  of  2  con¬ 
nective  functions.  The  number  of  possible  dichotomies  ie  called 
the  transfer  variability  and  the  base  2  logarithm  of  this  number 
is  called  the  transfer  entropy.  H  .  Notice  that  H  is  analogous 
to  source  entropy  in  that  it  is  a  measure  of  the  variability 
forced  upon  the  system. 

The  adaptive  classifier  has  a  number  of  possible  internal  states, 
each  such  state  being  a  different  decision  surface.  The  train¬ 
ing  process  involves  finding  the  internal  state  which  satisfies 
the  desired  dichotomy.  The  base  2  logarithm  of  the  number  of 
internal  states  is  called  the  adaptive  capacity,  H  ,  of  the 
classifier. 

THEOREM t 

For  an  adaptive  classifier  to  be  able  to  perform  all  of  the 
possible  dichotomies  of  N  patterns,  it  is  necessary  for  to  at 

least  equal  H  . 

v> 


The  proof  of  this  involves  assuming  that  a  classifier  can  be 
trained  to  perform  all  dichotomies  of  N  patterns  where  H 

8 

less  than  H^.  This  leads  to  the  conclusion  that  one  internal 
state  can  perform  more  than  one  dichotomy  of  the  patterns  or 
that  two  or  more  of  the  dichotomies  are  identical.  Both  results 
contradict  the  definitions. 

Now  let  us  consider  the  application  of  this  theorem  to  two  well~ 

2 

known  adaptive  classifiers,  the  SOBLN  and  the  linear  threshold 
classifier^. 

The  SOBLN  is  a  logical  connective  system  having,  in  general,* 
n  binary  inputs.  As  an  example,  a  SOBLN  for  n  »  3  is  shown 
schematically  in  Figure  2.  It  computes  the  2n  logical  products 
and  logically  combines  them  using  an  OR  gate  and  transmission 
weights  to  produce  the  output.  It  can  be  seen  that  the  SOBLN 
can  realize  all  dichotomies  of  N  patterns  up  to  the  maximum  N 
of  2n.  The  maximum  Ht  is 

N  2  ^  n  * 

-  log22  “  ^°^22  **  2  bits 

Thus,  H  by  the  theorem  must  be  at  least  2n  bits.  The  SOBLN 
8 

contains  2n  weights,  VT  ,  which  form  the  variable  portion  of  its 
structure.  Since  there  is  no  reason  to  expect  any  weight  to 
require  more  variability  than  any  other  we  can  establish  the 


i 


variability  necessary  for  each  weight  by 


H 


H. 


number  of  weights 


number  of  weights 

„n 


■  1  bit  per  weight 


Therefore,  at  least  one  bit  per  weight  is  necessary.  And  in 
this  case,  we  can  see  by  direct  enumeration  that  one  bit  per 
weight  is  also  sufficient. 

Now  let  us  proceed  to  the  linear  threshold  classifier;  an 

example  is  depicted  in  Figure  3.  We  know  that  for  n  binary 

inputs  and  N  ■  2n  patterns,  the  classifier  can  not  perform  all 
2n  1 

2  dichotomies.  However,  it  has  been  shown  by  a  number  of 
investigators  that  the  ’’natural  capacity"1  of  a  linear  threshold 
device  is  2 (n  +  1).  In  other  words,  if  N  is  equal  to  or  less 
than  2(n  +  1)  the  probability  that  all  dichotomies  can  be  per¬ 
formed  is  approximately  unity  (for  large  n)  and  it  is  approxi¬ 
mately  zero  if  N  is  greater  than  2 (n  +  1) .  Again  taking  a  maxi- 

• 

t 

mum  case,  we  shall  for  the  moment  apr ,;me  that  for  any  2,(n  +  1) 
binary  patterns  ail  dichotomies  can  be  performed. 


Thus, 


H  »  log  2^n  1^  «*  2  (n  +  1)  bits 


And  again,  since  we  have  no  reason  to  suspect  that  any  weight 
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will  require  more  flexibility  than  any  other  weight,  we  find 
that 


number  of  weights  number  of  weights 


£  2 (n  +  1)  “  2  bits  per  weight 

(n  +  1) 

Thus,  a  linear  threshold  classifier  must  have  at  least  2  bits 
per  weight  to  be  able  to  perform  all  of  the  possible  dichotomies 
within  its  "natural  capacity"  of  2 (n  +  1)  patterns. 

CONCLUSION 

A  necessary  condition  for  an  adaptive  classifier  to  be  able  to  be 
trained  for  a  given  task  has  been  developed.  In  at  least  one 
case,  the  SOBLN,  the  theorem  also  leads  to  a  sufficient  condition. 
For  the  linear  threshold  classifier  it  has  been  shown  experiment¬ 
ally  and  theoretically,  that  two  bits  or  less  per  weight  will,  in 

many  cases,  do  as  well  in  a  dichotomy  problem  as  weights  with 

3  , . 

greater  variability.  This  leads  us  to  believe  that  for  a  linear 

threshold  classifier  with  N  equal  to  or  less  than  2 (n  +  1)  and 

for  those  realizable  dichotomies  of  the  patterns,  that  2  bits  per 

weight  are  sufficient.  This  belief  is,  of  course,  still  in  the 

form  of  a  conjecture,  but  we  are  working  on  its  proof. 
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